Please check out our Shiny App: Happy Spotify for interactive visualizations.
Spotify is an audio streaming platform, which provides access to over 50 million tracks. As of October 2019, the platform had 248 million monthly active users, including 113 million paying subscribers. Spotify offers daily ranking of songs in each region. It should be interesting if we can figure out the regular pattern of people listening to songs.
We want to explore the following four aspects:
Are there any trends of songs’ steams over time?
Do people show special preference to some audio features?
Is there anything thought-provoking for singers? For example, do they have some genres? Or do they share some similarties with other singers?
What about the lyrics? Are there any interesting common words for different songs?
We used open-source data from Spotify, and collected the features of each song using its API. We got the list of the songs to focus on by scraping data from Spotify Charts which contains the daily top 200 songs.
Besides, we got the connection among singers using related-artists from Spotify. Due to the large amount of artists, we only selected top 100 singers as our dataset and studied their network.
Since Spotify itself does not offer track lyrics, we scraped lyrics from Genius. Since we don’t know the track ID in Genius, we have to use track name and artist name to match a song and get its lyrics. Songs can not be properly matched were ignored.
Also, due to the size of the dataset (there are too many songs right now!), it would be impossible to carefully analyze every song on the list. We focused on the top songs and the most popular singers.
Yichi Liu collected the daily ranking data of different countries, Rui Bai collected connections between artists, and Yuchen Pei collected lyrics of top songs.
Daily ranking data: 2837664 records of 23 variables.
track_id, Track.Name, Artist, Date, Region, artists, artists_IDs, Artist_ID.Position, Streams, key, mode, duration_ms, time_signaturedanceability, energy, loudness, speechiness, acousticness, instumentalness, liveness, valence, tempo.Yearly ranking data: 6300 records of 22 variables.
track_id, Track.Name, Artist, Region, artists, artists_IDs, Artist_ID.Position, Streams, key, mode, duration_ms, time_signaturedanceability, energy, loudness, speechiness, acousticness, instumentalness, liveness, valence, tempo.Global singer connection: an adjacency Matrix with 67 nodes.
Lyrics: 1552 records of 2 variables.
id, lyric.Our raw data are quite messy. One song may have multiple track ids for different albums. Even some albums were multi-labeled. Hence, we need to locate the data by both Artist and Track.Name instead of only id. Moreover, since Spotify records songs from all over the world, there are many Greek characters and many other languages in track names and artist names. This is problematic when we scraped other information based on the track name and artist name. Those characters would cause meaningless word clouds in our further analysis. We thus substituted Greek characters with English characters instead and the meaningless data were dropped.
After scraping the daily data from Spotify Charts, there are only information about the track name, track id and artists. We wanted to get more information behind the songs. For each song, we crawled its features by its track id.
Also, since there are many songs involving collaboration, artist_id may contain multiple ids. Thus, we extracted the id for the main artist.
The existing charts like Billboard and playlists for the top songs in Spotify are for 2018. Since we are already at the end of 2019, we want to get the latest one. Thus, we generated the yearly ranking of the songs in different countries during a year by adding up the daily streams of each song and ordering them by their total streams.
After scraping lyrics from Genius, since there are too many non-English characters in track names or artist names, we replaced them with corresponding English characters and then dropped additional information between parentheses or after horizontal bars.
For daily data, there are six missing patterns in total, and most of the data have no missing value. We noticed that some of the data have no track name or artist_id. We figured out the reason by looking at the website. For example, for the data with missing
track_id, we checked the original website and found that there was indeed no song information of it, and their features could not be merged as well. Some other data have missing track name because of the website error. We dropped those data.
We used the global yearly ranking data (Top 100 from Nov.1 2018 to Oct.31 2019) to get an overview of the most popular songs and singers during this period.
Top songs
After summing up the daily streams of each songs, we got the top 10 songs as follows.
This table shows that Sunflower, an episode of the movie Spider-Man: Into the Spider-Verse, ranked first. This movie was the first animated feature film in the Spider-Man franchise. Once the film released on Dec.14 2018, it received praise worldwide for its animation, characters, story, voice acting, humor and soundtrack. Thus, it is no surprise that Sunflower was so popular last year.
| Rank | Track | Artist | Stream |
|---|---|---|---|
| 1 | Sunflower - Spider-Man: Into the Spider-Verse | Post Malone | 1066802067 |
| 2 | bad guy | Billie Eilish | 907019009 |
| 3 | thank u, next | Ariana Grande | 902085812 |
| 4 | 7 rings | Ariana Grande | 883900704 |
| 5 | Señorita | Shawn Mendes | 875370863 |
| 6 | Shallow | Lady Gaga | 788314764 |
| 7 | Without Me | Halsey | 738657161 |
| 8 | Happier | Marshmello | 731706076 |
| 9 | Wow. | Post Malone | 721656407 |
| 10 | I Don’t Care (with Justin Bieber) | Ed Sheeran | 710241659 |
Let’s consider top songs from another aspect. If one song stayed on the ranking for a long time, it should also be a popular song. We calculated the number of days that each song stays in Top 100 using the global daily dataset and drew a cleveland plot. Since we only cared about “top” songs, we dropped those songs which stayed on the ranking for less than 100 days.
There were 23 songs that stayed in Global Top 100 for the last whole year. Although few of them had a high ranking, they are still enduring and popular.
Top singers
In Top 100 songs, there were 67 singers in total. We defined top singers as those artists who had more than one popular songs and drew a barchart for them.
Among the 17 top singers, the most popular one was Post Malone, who had 6 songs in Top 100, while Billie Eilish, Ed Sheeran and XXXTENTACION each had 4 songs.
It is intersting to find that though Ariana Grande had two super pop songs, the number of her songs on the ranking was not the most. Therefore, when we judge whether an artist is hot, we should not only consider how popular his works are but also how many popular works he has.
For further investigation, we picked the top 4 singers including Post Malone, Billie Eilish, Ed Sheeran and XXXTENTACION.
Similarly, we studied the total number of days of each artist on the daily ranking last year. 25 singers stayed in the ranking for the whole year, including those 4 top singers we discussed above. Now it should be safer to draw the conclusion that these people are indeed the most popular singers worldwide (based on Spotify’s data).
Before we start to analyze our data, we first need to understand the meaning of each audio feature. Explanations are from Spotify API.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typical range between -60 and 0 db.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording, the closer to 1.0 the attribute value.
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
To get an overview understanding of the data, it is important to find out how many songs people listen every day. Since we could not get the actual daily total streams among all songs, we computed the total streams for the Top 200 songs each day. The top songs are representative songs that people listened and the major components of the actual streams. Hence, it’s reasonable for the simulation.
As shown in the graph above, there is a clear cyclical trend. By hovering the mouse on the line, we could find that the differences between local peaks are always 7 days, which reminds us about a weekly trend. Thus, we further faceted by weekdays and identified an interesting finding, that the average total streams is much higher on Friday than Sunday. Also, the average is rising during the weekdays and dropping on weekends. This shows that people prefer to listen to music during weekdays, especially Friday, while their preference to music is lower on weekends.
This looks surprising at the first glance that people should preferred enjoying themselves on weekends, for example, by listening to music. One possible reason could be that people have a great amount of choices for relaxation during weekends. They could use the entire spare time to go camping, watch movies and spend time with family and friends. Although people’s desire for relaxation remains the same in weekdays, they only have fragmented spare time. Listening to music seems to be the best way to relax. Also, people enjoy Happy-Friday-Nights, so there exists a clear peak for Friday.
Furthermore, although seasonal preference to music does not exist in the graph, the total streams on Dec 24th, which is Christmas Eve, is extremely higher than others. The special trend does not appear in any other festival. Why is Christmas Eve an exception? Many famous Christmas songs come to our mind. Christmas is a festival of songs! To verify our hypothesis, we took a look at the top songs at Christmas Eve.
| Rank | Track | Artist | Streams |
|---|---|---|---|
| 1 | All I Want for Christmas Is You | Mariah Carey | 10819009 |
| 2 | Last Christmas | Wham! | 9098668 |
| 3 | Santa Tell Me | Ariana Grande | 7086794 |
| 4 | It’s Beginning to Look a Lot like Christmas | Michael Bublé | 6877219 |
| 5 | Jingle Bell Rock | Bobby Helms | 6040533 |
| 6 | It’s the Most Wonderful Time of the Year | Andy Williams | 5960727 |
| 7 | Rockin’ Around The Christmas Tree | Brenda Lee | 5768868 |
| 8 | Happy Xmas (War Is Over) - Remastered | John Lennon | 5692945 |
| 9 | Do They Know It’s Christmas? - 1984 Version | Band Aid | 5497071 |
| 10 | Wonderful Christmastime [Edited Version] - Remastered 2011 / Edited Version | Paul McCartney | 5040731 |
Most of the top songs are indeed songs for Christmas. Christmas could remind people of those songs, which contributes to the high streams.
Regardless of the overall trend, is there any typical popularity trends once a song is on board? We drew line chart for the streams v.s. the days since it is on board. By clustering by trend, we concluded that there are 3 types of popular songs and we drew the trend for one typical song in each group.
Type 1: falling. Those songs were listened the most when they were just on board. Their ranking fell as time goes by. The popularity of those songs is usually related to the reputation of the artist.
Type 2: rising before falling. They ranked high on the chart for a long time before falling down. This might because the songs are in good quality. Regardless of the effect of singers, people just loved the songs.
Type 3: rising. The total streams of those songs kept rising. And the rank of those songs moved to top and stayed on the top. Although they were not expected as a great song, their great melody engages people.
As concluded above, there is no strict popularity trend for a song. Some of them are popular at the beginning and fall soon while some of them are preferred by more and more people. It’s all about the quality of the song. So don’t be sad if the song is not popular at once and don’t be overconfident that the song will always be popular!
In order to analyze track features, we first need to know the distribution of each feature, including danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo. We standardized the features so that they have the same scale to compare. The following boxplot shows their distribution.
It is clear that danceability, duration_ms, energy, loudness, tempo and valence were approximately evenly distributed around 0, which means people didn’t show special preference for these features. Acousticness, liveness and speechiness all had a mean below 0, along with some outliers, indicating that generally people prefer songs with low value of these features, but it does not necessarily mean that more lively songs could not be popular.
For instrumentalness, we found that most of its values were below 0, while some were very large. By recalling its definition–a predictor of whether a track contains no vocals, our finding can be explained because it’s normal that most of popular songs contain vocal content, resulting in low instrumentalness, and those outliers represent popular instrumental tracks.
We are curious about whether there exists some correlations between each pair of these features, so we calculated their correlation matrix and drew a heatmap. It’s easy to find that only loudness and energy have a significantly positive correlation with coefficient 0.763 while other features seem to have no clear correlations.
Moreover, we are insterested in the influence of features on the ranking of a track. From the heatmap, we may conclude that feature values have nothing to do with yearly rank. This finding is really useful because now that people don’t have preference about the track features, artists don’t need to deliberately change their style to cater to the public tastes.
Line chart for each feature is drawn over the year after rescaling to make each start from 100.
The plot indicated an obvious exception on Dec 25th, 2018. On that day, people preferred songs with high acousticness, loudness and low dancability, energy and speechiness. Since it is mentioned that the loudness is positively related to energy, the result looks weird. As mentioned before, people loved Christmas songs on Christmas, for example, “White Christmas”, “It’s Beginning to Look a Lot like Christma”. We checked the features for the Christmas songs and found that those songs indeed have such special features combination.
The interactive choropleth plot is available in Shiny App. By selecting different features in the App, we find that the distribution of the features also varies among countries.
For example, comparing the distribution of dancability and liveness, although people in Brazil do not prefer songs that are more suitable for dance, they love live music.
Some more findings are in Shiny App: for example, based on the distribution of valence and energy, South American people love positive and intensive songs more than North American people. Turkish people preferred instrumental music than spoken word music.
In total, although features have no direct correlation with the ranking, different song features are preferred at different time points and in different countries.
After standardizing the features, we got the radar chart of every singer as follows.
It’s easy to find that all their songs had low instrumentalness except for Billie Eilish’s “Bad Guy” and “Bury a Friend”.
Post Malone’s songs had high loudness, energy and dancebility. Unlike his other songs, “Wow” had relativily high speechiness since it is a rap song. And since “Sunflower” is a song involving collaboration, if we ignore its influence, we can see that the valence of his songs is not very high.
Ed Sheeran’s songs had high loudness, energy and dancebility and low tempo and liveness. Except for “Shape of you”, his top songs had low acousticness. The valence of “Perfect” is low for that it was a romantic ballad written about his fiancée.
The radar chart of Billie Eilish is very interesting since features of her different songs varied. The four songs all had a tempo within the medium range. “Wish you were a gay” is a song with extremely high danceability and liveness. “When the party’s over” is a song with extremely high acousticness and low liveness, energy, loudness. “Bury a friend” had high acousticness, speechiness and instrumentalness. “Bad guy” had high speechiness and relatively high valence. It indicates that Billie Eilish is not limited to one particular style.
For XXXTENTACION, features of his songs were somehow like those of Post Malone’s. His songs had relatively high loudness, energy, danceability. Tempo of his songs was not very high but speechiness was high.
The songs of the top singers are great examples to explore the specific peaks in the streams for the songs. We drew the time series for those who own Top 100 songs in the year. The trends of streams for each song are shown in the graph below.
Post Malone: In the graph there are two significant peaks. To find out why the peaks exist, we searched about the date and the song. On Dec 14th, 2018, the movie: “Spider-Man: Into the Spider-Verse” was released while one of its tracks: “Sunflower” was played more than normal. On Sep 6th, 2018, he released a new album “Hollywood’s Bleeding” and included several existing popular songs so there was a great peak on that day.
Billie Ellish: There are three significant peaks. On Feb 1st, 2019 and March 29th, 2019, she released a new single and album: “bury a friend”, “When We All Fall Asleep, Where Do We Go?” which contributed to the two peaks, respectively. There was also a peak on July 11st, 2019 without new album. Actually, she held a concert in CA so that people might be engaged by her voice.
Ed Sheeran: Songs like “Shape of You” and “Perfect” were always on chart. They were released years ago, which means that they are songs people always like. May 10th, 2019 was a great peak. On that day, the song “I Don’t care” was released which was a second collaboration with Justin Bieber after 4 years. They are both famous singers and the collaboration was expected by fans for a long time, which caused a high rise. Also, as happened to others, the streams popped when his new album was out on July 12nd.
In total, the popularity of songs is triggered by events. Big events contribute to a sudden rise of the streams.
After getting related singers of each artist in Top 100, we constructed a network of these artists. Simular singers are connected by an edge, where similarity is based on analysis of the Spotify community’s listening history.
We found that many singers do not share similarity with other top singers. For example, BTS does not connect to any other artists in this network, which means people who like to listen BTS’s songs do not tend to listen songs from other singers in the top ranking. We have 15 such “lonely” singers, while others form several clusters with different colors in the network. Users that prefer Ed Sheeran’s songs may also like listening to Taylor Swift or James Arthur. Maluma, Ozuna, Lunay, Jhay Cortez and Dalex share relatively high similarity with one another. Another insteresting finding is that some singers are the “bridge” among multiple clusters. For example, Benny Blanco is the node that links three clusters together. People who listen to his songs have a preference on singers in the linked clusters as well.
Another interesting component of songs is the lyrics. We are interested in what different types of songs are talking about, and thus we explored this topic by visualizing the frequent words in word cloud format.
We have created an interactive word cloud (available in Shiny App) in which users have control over what songs to visualize by selecting ranges of three song features, i.e. danceability, energy, and loudness. Some other song features are not included due to their irrelevance to lyrics theme. For example, instrumentalness is not included since songs with different instrumentalness naturally have difference in lyrics lengths instead of lyrics content.
By playing around with the sliders, we have some interesting findings. For example, if we change the danceability slider to high and low, we can get the two clouds respectively as below.
(Note: Only the first cloud is directly rendered by wordclou2. Due to the issues with this package (see SO post, the other 5 clouds are webshotted and rendered in png format.)
We can see that many of the frequent words (with larger sizes) are swear words, which has a significant difference compared to songs with lower danceability (we will show later). We suspect that the songs selected are mostly hip-hop songs, since they naturally have higher danceability. Thus, our hypothesis is that hip-hop songs are more likely to contain profanities than others, which we will test out in later sections.
When selecting songs with lower danceability, the resulted frequent words in the cloud are more gentle and somewhat story-telling, such as “heart” and “love”. One possibility is that many songs with lower danceability tend to be softer with romantic themes.
Thus, we can see that songs with different feature ranges indeed have differet content.
In order to further explore the differences in lyrics for different song themes, we have plotted the word clouds for four different playlists, including:
## phantomjs has been installed to /Users/nessyliu/Library/Application Support/PhantomJS
Halloween playlist
hip-hop playlist
Romance playlist
By simple comparison, we can see that different song themes indeed have different lyrics content. It is expected for Christmas songs to have frequent words such as “bells”, “santa”, and “snow”. Similarly, for Halloween playlist, we can see frequent words including “monster” and “night”, which are indeed horror-related.
Also, for songs with a romantic theme, they tend to have soft and sweet words such as “love”, “baby”, “heart” and “kiss”, this is similar to songs with lower danceability as we have shown above. On the contrary, the hip-hop songs are again full of bad language, which is quite similar to what we discovered in songs with high danceability, and it has verified our hypothesis that songs with higher danceability are likely to be the hip-hop songs containing a large amount of swear words.
In this project, we tried to find out the relationship between popular songs and time, features, singers and lyrics based on the daily data from Spotify. There are many interesting findings.
The total streams are changing by time. People listen to songs more on Friday than on Sunday. Also, on holidays like Christmas, people especially love songs. Regardless of the overall trend, there is no special trend for popular songs. Diving deep into songs features, there is no relationship between the song features and its ranking. Hence producer should not try to fit in the trend to produce a popular song. However, the preference for features varies in different time and countries. For the popular singers, they maintain an obvious genre, while singers with similar styles are connected with each other. Finally, word cloud of lyrics indicated that the content of song is different when the features take different values.
All in all, this report gives readers more insights on the songs. We hope people could enjoy music more after reading this report! Hooray!